A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications

نویسندگان

  • Hanane Froud
  • Abdelmonaim Lachkar
  • Said Alaoui Ouatik
چکیده

Representation of semantic information contained in the words is needed for any Arabic Text Mining applications. More precisely, the purpose is to better take into account the semantic dependencies between words expressed by the co-occurrence frequencies of these words. There have been many proposals to compute similarities between words based on their distributions in contexts. In this paper, we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased (Stemming), and Stem-based (Light Stemming) approaches for measuring the similarity between Arabic words with the well known abstractive model -Latent Semantic Analysis (LSA)with a wide variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the one hand, the variety of the corpus produces more accurate results; on the other hand, the Stem-based approach outperformed the Root-based one because this latter affects the words meanings.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

Qualitative and Quantitative Examination of Text Type Readabilities: A Comparative Analysis

This study compared 2 main approaches to readability assessment. Thequantitative approach applied idea density based on part of speech tagging andcompared 3 sets of text types (i.e., narrative, expository, and argumentative) withrespect to their ease of reading. The qualitative approach was done throughdeveloping questionnaires measuring intermediate EFL learners’ perceptions oncontent, motivat...

متن کامل

‘Repetition’ in Arabic-English Translation: The case of Adrift on the Nile

Abstract This study investigates ‘repetition’ in the English translation of the Arabic Novel, Adrift on the Nile (1993). It aims to explore the communicative functions of ‘repetition’ and to see if these functions have been maintained or lost in the process of translating the Novel. In addition, it seeks to find the translation strategies used in rendering ‘repetition’. To achieve this aim, a d...

متن کامل

‘Repetition’ in Arabic-English Translation: The case of Adrift on the Nile

Abstract This study investigates ‘repetition’ in the English translation of the Arabic Novel, Adrift on the Nile (1993). It aims to explore the communicative functions of ‘repetition’ and to see if these functions have been maintained or lost in the process of translating the Novel. In addition, it seeks to find the translation strategies used in rendering ‘repetition’. To achieve this aim, a d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012